A simple sketching algorithm for entropy estimation over streaming data

نویسندگان

  • Peter Clifford
  • Ioana Cosma
چکیده

We consider the problem of approximating the empirical Shannon entropy of a highfrequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Rényi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Rényi entropy by taking α sufficiently close to 1. However, practical guidelines for parameter calibration with respect to α are lacking. We avoid this problem by showing that the random variables used in estimating the Rényi entropy can be transformed to have a proper distributional limit as α approaches 1: the maximally skewed, strictly stable distribution with α = 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ > 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ = 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity. Appearing in Proceedings of the 16 International Conference on Artificial Intelligence and Statistics (AISTATS) 2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 by the authors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sketching and Streaming High-Dimensional Vectors

A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a small-space algorithm given just one pass over the data, a so-call...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

Streaming Anomaly Detection Using Randomized Matrix Sketching

Data is continuously being generated from sources such as machines, network traffic, application logs, etc. Timely and accurate detection of anomalies in massive data streams have important applications in preventing machine failures, intrusion detection, and dynamic load balancing. In this paper, we introduce a new anomaly detection algorithm, which can detect anomalies in a streaming fashion ...

متن کامل

Using an Evaluator Fixed Structure Learning Automata in Sampling of Social Networks

Social networks are streaming, diverse and include a wide range of edges so that continuously evolves over time and formed by the activities among users (such as tweets, emails, etc.), where each activity among its users, adds an edge to the network graph. Despite their popularities, the dynamicity and large size of most social networks make it difficult or impossible to study the entire networ...

متن کامل

A simple sketching algorithm for entropy estimation

We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream when space limitations make exact computation infeasible. It is known that αdependent quantities such as the Rényi and Tsallis entropies can be estimated efficiently and unbiasedly from low-dimensional α-stable data sketches. An approximation to the Shannon entropy can be obtained from either ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013